Please run this code from the main directory.

cd llm_optimization


## Training
```
python simple_dpo/train_MO_PO.py --lr 5e-5  --batch_size 4 --P_vec 0.1 0.8 0.1 --epoch 1 --accumulation_steps 5 --beta 0.1 --mu 100  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" --use_lora --dataset "openbmb/UltraFeedback" --use_mix_precision --checkpoint_path "ultra_mu1ultra_mu1003d8B003d" --exclude_cols "truthfulness" "overall" --load_in_X_bit
```

* `lr`: DPO learning rate
* `batch_size`: training batch size
* `P_vec`: perference vector, should match with the size of dimension
* `epochs`: training epoches
* `accumulation_steps`: accumulated steps used to compute the training loss
* `beta`: beta in DPO
* `mu`: controls the trade-off between objectives
* `model`: model name
* `use_lora`: use Lora or not
* `datasetset`: dataset name, all datasets are defined under data_prep
* `use_mix_precision`: use float 16
* `checkpoint_path`: checkpoint path
* `load_in_X_bit`: 8 bit quantization,
* `loss_type`: args.loss_type, DPO/PPO
* `seed`: args.seed, random seeds, defualt 0
* `exclude_cols`: columns to be excluded from the dataset


## Evaluation

After training, execute the following steps to generate responses, compute rewards and DPO loss, and save the results in pickle files.

```
python simple_dpo/evaluation_safety.py  --model "meta-llama/Llama-3.2-1B-Instruct" --checkpoint "{checkpoint_path}" --use_lora
```

* `evaluation.py` is used to evaluate the UltraFeedback dataset.
* `evaluation_safety.py` is used to evaluate the safety dataset.